This notebook purpose is clearly understand to whetherAUS(predicition to whether rainy for AUS ) dataset. In here mostly used visualation for this. For prediction used Machine learning. In conclusion our ACC: %84 found. Description About Columns: Date: The date of observation Location: The common name of the location of the weather station MinTemp: The minimum temperature in degrees celsius MaxTemp: The maximum temperature in degrees celsius Rainfall: The amount of rainfall recorded for the day in mm Evaporation: The so-called Class A pan evaporation (mm) in the 24 hours to 9am Sunshine: The number of hours of bright sunshine in the day WindGustDir: The direction of the strongest wind gust in the 24 hours to midnight WindGustSpeed: The speed (km/h) of the strongest wind gust in the 24 hours to midnight WindDir9am: Direction of the wind at 9am WindDir3pm: Direction of the wind at 3pm WindSpeed9am: Wind speed (km/hr) averaged over 10 minutes prior to 9am WindSpeed3pm: Wind speed (km/hr) averaged over 10 minutes prior to 3pm Humadity9am: Humidity (percent) at 9am Humadity3pm: Humidity (percent) at 3pm Pressure9am: Atmospheric pressure (hpa) reduced to mean sea level at 9am Pressure3pm: Atmospheric pressure (hpa) reduced to mean sea level at 3pm Cloud9am: Fraction of sky obscured by cloud at 9am. This is measured in "oktas", which are a unit of eigths. It records how many eigths of the sky are obscured by cloud. A 0 measure indicates completely clear sky whilst an 8 indicates that it is completely overcast Cloud3pm: Fraction of sky obscured by cloud (in "oktas": eighths) at 3pm Temp9am: Temperature (degrees C) at 9am Temp3pm: Temperature (degrees C) at 3pm RainToday: Boolean: 1 if precipitation (mm) in the 24 hours to 9am exceeds 1mm, otherwise 0 RainTomorrow: The amount of next day rain in mm. Used to create response variable RainTomorrow. A kind of measure of the "risk".

In [37]:
# importing libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.ensemble import RandomForestClassifier
from sklearn.cluster import KMeans
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from scipy import stats
from sklearn.model_selection import train_test_split
import plotly.express as px
import plotly.express as px
%matplotlib inline
In [2]:
# Calling the dataset
df = pd.read_csv('weatherAUS.csv')
df.set_index('Date', inplace=True)
In [3]:
# Checking the first 5 rows to find out the missing data. there is a bunch of missing data in the dataset
df.head(5)
Out[3]:
Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed WindDir9am WindDir3pm ... Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RainTomorrow
Date
2008-12-01 Albury 13.4 22.9 0.6 NaN NaN W 44.0 W WNW ... 71.0 22.0 1007.7 1007.1 8.0 NaN 16.9 21.8 No No
2008-12-02 Albury 7.4 25.1 0.0 NaN NaN WNW 44.0 NNW WSW ... 44.0 25.0 1010.6 1007.8 NaN NaN 17.2 24.3 No No
2008-12-03 Albury 12.9 25.7 0.0 NaN NaN WSW 46.0 W WSW ... 38.0 30.0 1007.6 1008.7 NaN 2.0 21.0 23.2 No No
2008-12-04 Albury 9.2 28.0 0.0 NaN NaN NE 24.0 SE E ... 45.0 16.0 1017.6 1012.8 NaN NaN 18.1 26.5 No No
2008-12-05 Albury 17.5 32.3 1.0 NaN NaN W 41.0 ENE NW ... 82.0 33.0 1010.8 1006.0 7.0 8.0 17.8 29.7 No No

5 rows × 22 columns

In [4]:
# Checking columns
df.columns
Out[4]:
Index(['Location', 'MinTemp', 'MaxTemp', 'Rainfall', 'Evaporation', 'Sunshine',
       'WindGustDir', 'WindGustSpeed', 'WindDir9am', 'WindDir3pm',
       'WindSpeed9am', 'WindSpeed3pm', 'Humidity9am', 'Humidity3pm',
       'Pressure9am', 'Pressure3pm', 'Cloud9am', 'Cloud3pm', 'Temp9am',
       'Temp3pm', 'RainToday', 'RainTomorrow'],
      dtype='object')
In [5]:
df.values
Out[5]:
array([['Albury', 13.4, 22.9, ..., 21.8, 'No', 'No'],
       ['Albury', 7.4, 25.1, ..., 24.3, 'No', 'No'],
       ['Albury', 12.9, 25.7, ..., 23.2, 'No', 'No'],
       ...,
       ['Uluru', 5.4, 26.9, ..., 26.1, 'No', 'No'],
       ['Uluru', 7.8, 27.0, ..., 26.0, 'No', 'No'],
       ['Uluru', 14.9, nan, ..., 20.9, 'No', nan]], dtype=object)
In [6]:
# checking shape
df.shape
Out[6]:
(145460, 22)
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Index: 145460 entries, 2008-12-01 to 2017-06-25
Data columns (total 22 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   Location       145460 non-null  object 
 1   MinTemp        143975 non-null  float64
 2   MaxTemp        144199 non-null  float64
 3   Rainfall       142199 non-null  float64
 4   Evaporation    82670 non-null   float64
 5   Sunshine       75625 non-null   float64
 6   WindGustDir    135134 non-null  object 
 7   WindGustSpeed  135197 non-null  float64
 8   WindDir9am     134894 non-null  object 
 9   WindDir3pm     141232 non-null  object 
 10  WindSpeed9am   143693 non-null  float64
 11  WindSpeed3pm   142398 non-null  float64
 12  Humidity9am    142806 non-null  float64
 13  Humidity3pm    140953 non-null  float64
 14  Pressure9am    130395 non-null  float64
 15  Pressure3pm    130432 non-null  float64
 16  Cloud9am       89572 non-null   float64
 17  Cloud3pm       86102 non-null   float64
 18  Temp9am        143693 non-null  float64
 19  Temp3pm        141851 non-null  float64
 20  RainToday      142199 non-null  object 
 21  RainTomorrow   142193 non-null  object 
dtypes: float64(16), object(6)
memory usage: 25.5+ MB
In [8]:
# showing the values of RainToday column
df['RainToday'].value_counts()
Out[8]:
No     110319
Yes     31880
Name: RainToday, dtype: int64

Exploratory Data Analisys

In [9]:
# correlation between values
corrmat = df.corr()
plt.subplots(figsize=(16,16))
sns.heatmap(corrmat,annot=True,square=True)
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x2094d52e8e0>

As it is obvious in the plot there is high correlation betwen some components. For example, Raonfall has hgh correlation with Min Temp, Cloud, and humidity. the lower the tempreture and the higher the humidity, the higher is the chance of raining

In [10]:
# RainToday values
sns.countplot(data=df, x='RainTomorrow', palette='Pastel1')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x2094a1b4c40>
In [11]:
# checking the cities with the highest number of rainfalls
df[['Location','Rainfall']].groupby('Location').mean().sort_values(by='Rainfall', ascending=False).iloc[:20]
Out[11]:
Rainfall
Location
Cairns 5.742035
Darwin 5.092452
CoffsHarbour 5.061497
GoldCoast 3.769396
Wollongong 3.594903
Williamtown 3.591108
Townsville 3.485592
NorahHead 3.387299
Sydney 3.324543
MountGinini 3.292260
Katherine 3.201090
Newcastle 3.183892
Brisbane 3.144891
NorfolkIsland 3.127665
SydneyAirport 3.009917
Walpole 2.906846
Witchcliffe 2.895664
Portland 2.530374
Albany 2.263859
BadgerysCreek 2.193101
In [12]:
# scatter plot of humidity in 9 am and Rainfall

fig = px.scatter(df, x="Humidity9am", y="Rainfall",
           marginal_x="box", trendline="ols", template="simple_white", marginal_y="violin")
fig.show()

once more we can see the correlation between humidity and the rainfall. the higher the humidity the higher the amount of rainfall.

In [13]:
# 10 top cities with highest rainfall
fig= px.bar(df[['Location','Rainfall']].groupby('Location').mean().sort_values(by='Rainfall',ascending=False).iloc[:20],
            labels={'value':'Rainfall', 'Location':'City'})

fig.show()

here we can see 10 top cities with highest amount of rain in last 20 years.

Data Cleaning

In [14]:
# defining unique values in each column
for col in df.columns:
    n_unique= df[col].nunique()
    print(f'column name:{col}, number of unique values:{n_unique}')
column name:Location, number of unique values:49
column name:MinTemp, number of unique values:389
column name:MaxTemp, number of unique values:505
column name:Rainfall, number of unique values:681
column name:Evaporation, number of unique values:358
column name:Sunshine, number of unique values:145
column name:WindGustDir, number of unique values:16
column name:WindGustSpeed, number of unique values:67
column name:WindDir9am, number of unique values:16
column name:WindDir3pm, number of unique values:16
column name:WindSpeed9am, number of unique values:43
column name:WindSpeed3pm, number of unique values:44
column name:Humidity9am, number of unique values:101
column name:Humidity3pm, number of unique values:101
column name:Pressure9am, number of unique values:546
column name:Pressure3pm, number of unique values:549
column name:Cloud9am, number of unique values:10
column name:Cloud3pm, number of unique values:10
column name:Temp9am, number of unique values:441
column name:Temp3pm, number of unique values:502
column name:RainToday, number of unique values:2
column name:RainTomorrow, number of unique values:2
In [15]:
# get the number of missing values
missing_values_count= df.isnull().sum()
missing_values_count
Out[15]:
Location             0
MinTemp           1485
MaxTemp           1261
Rainfall          3261
Evaporation      62790
Sunshine         69835
WindGustDir      10326
WindGustSpeed    10263
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am      1767
WindSpeed3pm      3062
Humidity9am       2654
Humidity3pm       4507
Pressure9am      15065
Pressure3pm      15028
Cloud9am         55888
Cloud3pm         59358
Temp9am           1767
Temp3pm           3609
RainToday         3261
RainTomorrow      3267
dtype: int64

There is a lot of missing values which might impact the accuracy of the models. So we need to deal with them before making prediction models

In [16]:
# ploting a heatmap to have a visual about the mising data. most data in Evaporation,Sunshine,Cloud9am,Cloud3pm are missing. there are a few missing data in other columns as well
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x20949d25220>
In [17]:
# How many missing values do we have? 
total_cells = np.product(df.shape)
total_missing = missing_values_count.sum()

# check out the percentage of missing values
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)
10.726097771333574

about 10 percent of the total values are missing

In [18]:
# check how many data we loose if we drop missing values. we almost lose 40 percent of the rows which is not good
print(df.shape)
'\n'
print(df.dropna().shape)
(145460, 22)
(56420, 22)

It turns out if we just drop the missing values we wil miss almost 70 percent of the data.So we need to find better ways to fill the missing data

In [19]:
new_df = df.copy()
In [20]:
# droping 4 columns with high missing values
new_df = df.drop(['Evaporation','Sunshine','Cloud9am','Cloud3pm'], axis=1)
In [21]:
# ploting a heatmap to have a visual about the mising data. most data in Evaporation,Sunshine,Cloud9am,Cloud3pm are missing. there are a few missing data in other columns as well
sns.heatmap(new_df.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x2094a1febe0>
In [22]:
# replacing missing values using interpolate function
new_df = new_df.interpolate()
In [23]:
# there are still some missing values left 
new_df.isnull().sum()
Out[23]:
Location             0
MinTemp              0
MaxTemp              0
Rainfall             0
WindGustDir      10326
WindGustSpeed        0
WindDir9am       10566
WindDir3pm        4228
WindSpeed9am         0
WindSpeed3pm         0
Humidity9am          0
Humidity3pm          0
Pressure9am          0
Pressure3pm          0
Temp9am              0
Temp3pm              0
RainToday         3261
RainTomorrow      3267
dtype: int64

we dropped 4 columns with highest missing values and filled other ones using interpolate function. Still there are missing data which need to be filled. We use the mode of the data to fill the rest.

In [24]:
# replacing missing values with mode
new_df['WindGustDir'].fillna(new_df['WindGustDir'].mode()[0], inplace =True)
new_df['WindDir9am'].fillna(new_df['WindDir9am'].mode()[0], inplace =True)
new_df['WindDir3pm'].fillna(new_df['WindDir3pm'].mode()[0], inplace =True)
In [25]:
# dataset is almost cleaned, yet remained some missing target values
new_df.isnull().sum()
Out[25]:
Location            0
MinTemp             0
MaxTemp             0
Rainfall            0
WindGustDir         0
WindGustSpeed       0
WindDir9am          0
WindDir3pm          0
WindSpeed9am        0
WindSpeed3pm        0
Humidity9am         0
Humidity3pm         0
Pressure9am         0
Pressure3pm         0
Temp9am             0
Temp3pm             0
RainToday        3261
RainTomorrow     3267
dtype: int64
In [26]:
# dropping rest of the missing values
new_df.dropna(inplace=True)
In [27]:
# there is no missing value in the dataset
new_df.isnull().sum()
Out[27]:
Location         0
MinTemp          0
MaxTemp          0
Rainfall         0
WindGustDir      0
WindGustSpeed    0
WindDir9am       0
WindDir3pm       0
WindSpeed9am     0
WindSpeed3pm     0
Humidity9am      0
Humidity3pm      0
Pressure9am      0
Pressure3pm      0
Temp9am          0
Temp3pm          0
RainToday        0
RainTomorrow     0
dtype: int64

Data cleaning is done and there is no missing values in dataset

Data Labeling

in this part we have to convert the text data into labels to be able to fir them in ML models. there are four columns with text values. Once they are converted, they text columns will be dropped and the labeled columns will be concatenated with the dataframe.

In [28]:
# converting text to numerical values
Location = pd.get_dummies(new_df['Location'],drop_first=True)
WindGustDir = pd.get_dummies(new_df['WindGustDir'],drop_first=True)
WindDir9am = pd.get_dummies(new_df['WindDir9am'],drop_first=True)
WindDir3pm = pd.get_dummies(new_df['WindDir9am'],drop_first=True)
In [29]:
new_df.drop(['Location','WindGustDir','WindDir9am','WindDir3pm'],axis=1,inplace=True)
In [30]:
new_df = pd.concat([new_df,Location,WindGustDir,WindDir9am,WindDir3pm], axis=1)
In [31]:
new_df[['RainToday','RainTomorrow']]
Out[31]:
RainToday RainTomorrow
Date
2008-12-01 No No
2008-12-02 No No
2008-12-03 No No
2008-12-04 No No
2008-12-05 No No
... ... ...
2017-06-20 No No
2017-06-21 No No
2017-06-22 No No
2017-06-23 No No
2017-06-24 No No

140787 rows × 2 columns

As we see the two above columns have text values. to convert them to numerical values we use a function to convert 'Yes' to '1' and 'No' to '0'

In [32]:
def rain(data):
    
    if data == 'Yes':
        return 1
    else:
        return 0
In [33]:
new_df['RainToday']= new_df['RainToday'].apply(rain)
new_df['RainTomorrow']= new_df['RainTomorrow'].apply(rain)

splitting the data to test and train set

In [34]:
X_train, X_test, y_train, y_test = train_test_split(new_df.drop('RainTomorrow', axis=1),new_df['RainTomorrow'],
                                                    test_size=0.30)

Here we test the 5 ML models to choose the models with the highest accuracy

In [38]:
models = []
models.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('NB', GaussianNB()))
models.append(('RF', RandomForestClassifier()))
models.append(('K_means', KMeans(n_clusters=2)))

#evaluate each model in turn
results = []
names = []
for name, model in models:
	kfold = StratifiedKFold(n_splits=10, random_state=1, shuffle=True)
	cv_results = cross_val_score(model, X_train, y_train, cv=kfold, scoring='accuracy')
	results.append(cv_results)
	names.append(name)
	print('%s: %f (%f)' % (name, cv_results.mean(), cv_results.std()))
LR: 0.846738 (0.002959)
LDA: 0.847661 (0.002243)
KNN: 0.833871 (0.002633)
CART: 0.786565 (0.002861)
NB: 0.616124 (0.002485)
RF: 0.854896 (0.002543)
K_means: 0.181908 (0.026481)

So, it turns out the 'LDA' and 'LR' and 'RF' have highest accuracy.

Fitting data into LogisticRegression algorithm, making prediction model and printing out the results.

In [46]:
logmodel = LogisticRegression()
logmodel.fit(X_train,y_train)
C:\Users\Amir\anaconda3\lib\site-packages\sklearn\linear_model\_logistic.py:762: ConvergenceWarning:

lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Out[46]:
LogisticRegression()
In [47]:
predictions = logmodel.predict(X_test)
In [48]:
print(classification_report(y_test,predictions))
print('\n')
print(confusion_matrix(y_test,predictions))
              precision    recall  f1-score   support

           0       0.85      0.95      0.90     32784
           1       0.72      0.44      0.54      9453

    accuracy                           0.84     42237
   macro avg       0.79      0.69      0.72     42237
weighted avg       0.82      0.84      0.82     42237



[[31163  1621]
 [ 5308  4145]]

Making models using LinearDiscriminantAnalysis

In [49]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
In [50]:
clf = LinearDiscriminantAnalysis()
In [51]:
clf.fit(X_train, y_train)
Out[51]:
LinearDiscriminantAnalysis()
In [52]:
clf_predict = clf.predict(X_test)
In [53]:
print(confusion_matrix(y_test,clf_predict))
print(classification_report(y_test,clf_predict))
[[30887  1897]
 [ 4739  4714]]
              precision    recall  f1-score   support

           0       0.87      0.94      0.90     32784
           1       0.71      0.50      0.59      9453

    accuracy                           0.84     42237
   macro avg       0.79      0.72      0.74     42237
weighted avg       0.83      0.84      0.83     42237

Making Model using RandomForest

In [ ]:
from sklearn.ensemble import RandomForestClassifier
In [54]:
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
Out[54]:
RandomForestClassifier()
In [55]:
rfc_pred = rfc.predict(X_test)
In [56]:
print(confusion_matrix(y_test,rfc_pred))
print(classification_report(y_test,rfc_pred))
[[31430  1354]
 [ 4938  4515]]
              precision    recall  f1-score   support

           0       0.86      0.96      0.91     32784
           1       0.77      0.48      0.59      9453

    accuracy                           0.85     42237
   macro avg       0.82      0.72      0.75     42237
weighted avg       0.84      0.85      0.84     42237

Making Models using DecisiontreeClassifier

In [ ]:
from sklearn.tree import DecisionTreeClassifier
In [57]:
dtree = DecisionTreeClassifier()
In [58]:
dtree.fit(X_train,y_train)
Out[58]:
DecisionTreeClassifier()
In [59]:
dt_pred = dtree.predict(X_test)
In [60]:
print(confusion_matrix(y_test,dt_pred))
print(classification_report(y_test,dt_pred))
[[28243  4541]
 [ 4444  5009]]
              precision    recall  f1-score   support

           0       0.86      0.86      0.86     32784
           1       0.52      0.53      0.53      9453

    accuracy                           0.79     42237
   macro avg       0.69      0.70      0.69     42237
weighted avg       0.79      0.79      0.79     42237

KNN Clustering

In [42]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
[[28556  4228]
 [ 4630  4823]]
              precision    recall  f1-score   support

           0       0.86      0.87      0.87     32784
           1       0.53      0.51      0.52      9453

    accuracy                           0.79     42237
   macro avg       0.70      0.69      0.69     42237
weighted avg       0.79      0.79      0.79     42237

In [43]:
error_rate = []

# Will take some time
for i in range(1,40):
    
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    error_rate.append(np.mean(pred_i != y_test))
In [44]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')
Out[44]:
Text(0, 0.5, 'Error Rate')
In [45]:
knn = KNeighborsClassifier(n_neighbors=30)
knn.fit(X_train,y_train)
pred = knn.predict(X_test)
print(confusion_matrix(y_test,pred))
print(classification_report(y_test,pred))
[[31418  1366]
 [ 5248  4205]]
              precision    recall  f1-score   support

           0       0.86      0.96      0.90     32784
           1       0.75      0.44      0.56      9453

    accuracy                           0.84     42237
   macro avg       0.81      0.70      0.73     42237
weighted avg       0.83      0.84      0.83     42237

Conclusion

Problem Statement The dataset demonstrates the different weather components in the last 20 years. The problem is to make the best model based on the weather components to predict whether it is rainy tomorrow or not. If we make this model, we will make heads up about weather conditions for people who live around.

Exploratory Data Analysis We did some EDA to figure out the data relationships. The plots demonstrate the cities with the highest rainfalls. Moreover, there is a high correlation between humidity in 9 am and the cloud with the rainfall amount.

Data Cleaning Since they are many missing values, we had to clean the data before making models. To do so, we dropped four columns with the highest missing values. Then, we used interpolate function to fill na values. To fill the remained missing values, we took the data mode and filled the na spots with it.

Training models and make predictions We tested five different ML algorithms. The LR and LDA had the highest accuracy with 0.84 percent. So we fit the data in those models and the Decision Tree model and checked the precision and f1 scores of each of them. To cluster, we used KNN algorithm. To find the best number of the clusters, we used elbow method and plotted the error rate agains K values. So, th ebest K value is 30, and we reached 0.84 accuracy.

In [ ]: